The difference between two classifiers (algorithms) can be very small; however there are no two classifiers whose accuracies are perfectly equivalent.
By using an null hypothesis significance test (NHST), the null hypothesis is that the classifiers are equal. However, the null hypothesis is practically always false! By rejecting the null hypothesis NHST indicates that the null hypothesis is unlikely; but this is known even before running the experiment.
Can we say anything about the probability that two classifiers are practically equivalent (e.g., j48 is practically equivalent to j48gr)?
NHST cannot answer this question, while Bayesian analysis can.
We need to define the meaning of practically equivalent.
The rope depends:
Accuracy is a number in $[0,1]$. For practical applications, it is sensible to define that two classifiers whose mean difference of accuracies is less that $1\%$ ($0.01$) are practically equivalent. A difference of accuracy of $1\%$ is neglegible in practice.
The interval $[-0.01,0.01]$ can thus be used to define a region of practical equivalence for classifiers.
See it in action.
We load the classification accuracies of J48 and J48gr on 54 UCI datasets from the file Data/accuracy_j48_j48gr.csv
. For simplicity, we will skip the header row and the column with data set names.
In [2]:
import numpy as np
scores = np.loadtxt('Data/accuracy_j48_j48gr.csv', delimiter=',', skiprows=1, usecols=(1, 2))
names = ("J48", "J48gr")
Function signtest(x, rope, prior_strength=1, prior_place=ROPE, nsamples=50000, verbose=False, names=('C1', 'C2'))
computes the Bayesian signed-rank test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.
In [3]:
import bayesiantests as bt
left, within, right = bt.signtest(scores, rope=0.01,verbose=True,names=names)
The first value (P(J48 > J48gr)
) is the probability that the first classifier (the left column of x
) has a higher score than the second (or that the differences are negative, if x
is given as a vector).
The second value (P(rope)
) is the probability that they are practically equivalent.
The third value (P(J48gr > J48)
) is equal to 1-P(J48 > J48gr)-P(rope)
.
The probability of the rope is equal to $1$ and, therefore, we can say that they are equivalent (for the given rope).
Decision tree grafting (J48gr) was developed to demonstrate that a preference for less complex trees (J48) does not serve to improve accuracy. The point is that c has a consistent (albeit small) improvements in accuracy than J48.
The advanatge of having a rope is that we can test this hypothesis from a statistical point of view.
In [10]:
left, within, right = bt.signtest(scores, rope=0.001,verbose=True,names=names)
No the difference is less than 0.001 with probability 0.99
In [11]:
left, within, right = bt.signtest(scores, rope=0.0001,verbose=True,names=names)
The difference is therefore in the order of 0.0001. The difference is very small (around 1/10000), but in favour of J48gr.
In [12]:
left, within, right = bt.signtest(scores, rope=0.0001,prior_place=bt.RIGHT,verbose=True,names=names)
The conclusions are in this case sensitive to the prior (posterior changes 0.05 points). However, the overall conclusion does not change much. The difference is very small (around 1/10000), but in favour of J48gr.
In [88]:
%matplotlib inline
import matplotlib.pyplot as plt
left=np.zeros((10,1))
within=np.zeros((10,1))
right=np.zeros((10,1))
for i in range(9,-1,-1):
left[i], within[i], right[i] = bt.signtest(scores, rope=0.001/2**i,names=names)
plt.plot(0.001/(2**np.arange(0,10,1)),within)
plt.plot(0.001/(2**np.arange(0,10,1)),left)
plt.plot(0.001/(2**np.arange(0,10,1)),right)
plt.legend(('rope','left','right'))
plt.xlabel('Rope width')
plt.ylabel('Probability')
Out[88]:
In [13]:
left, within, right = bt.signrank(scores, rope=0.001,verbose=True,names=names)
In [14]:
left, within, right = bt.signrank(scores, rope=0.0001,verbose=True,names=names)
However, the conclusion is very similar. The difference is very small (1/10000), but in favour of J48gr.
In [ ]: